A Tidy Framework and Infrastructure to Systematically Assemble Spatio-temporal Indexes from Multivariate Data

H. Sherry Zhang

Joint Statistical Meeting
Portland, Oregon

Aug 7, 2024

Indexes

Tokyo 2020 sport climbing 🧗

Boulder: 4m wall, 3 problems in final

Lead: 15m wall, 1 problem

Speed: 15m wall, always the same

Three disciplines, one champion

In Tokyo 2020, this is how athletes are scored in the final:

  • In each discipline, athletes are ranked from 1 to 8 (top - bottom)
  • The final score is the multiplication of the ranks in each discipline

How the rank is determined is also different in each discipline:

  • boulder: number of tops and then number of zones
  • lead: number of holds reached (1 - 40) before falling
  • speed: fastest time to reach the top among three attempts

Who can win the gold medal?

Athletes Country Speed Boulder Lead Total Rank
Janja Garnbret Slovenia 5 1 1 5 1
Miho Nonaka Japan 3 3 5 45 2
Akiyo Noguchi Japan 4 4 4 64 3
Aleksandra Miroslaw Poland 1 8 8 64 4
Jane Doe #1 xxx 8 2 2 32 2?
Jane Doe #2 xxx 1 7 4 28 2?
Jane Doe #3 xxx 6 1 5 30 2?

Looks like:

  • Being top in one discipline can get you pretty far (Aleksandra Miroslaw)
  • (8, 2, 2) gets you higher than (1, 4, 7) (Jane Doe #1 vs. Jane Doe #2)
  • (1, 4, 7) can get you higher than (1, 5, 6) (Jane Doe #2 vs. Jane Doe #3)

Ranking sport climbing athletes is an index construction problem …

… where multivariate information is summarised into a single number 🏅

What’s the problem with indexes?

Inspired from tidymodel

A closer look at a class of drought indexes

The pipeline design (9 modules)

data with spatial (\(\mathbf{s}\)) and temporal (\(\mathbf{t}\)) dimensions: \[x_j(s;t)\]

  • Temporal processing: \(f[x_{sj}(t)]\)
  • Spatial processing: \(g[x_{tj}(s)]\)


  • Variable transformation: \(T[x_j(s;t)]\)
  • Scaling: \([x_j(s;t)- \alpha]/\gamma\)
  • Distribution fit: \(F[x_j(s;t)]\)
  • Normalising: \(\Phi^{-1}[x_j(s;t)]\)


  • Dimension reduction: \(h[\mathbf{x}(s;t)]\)
  • Benchmarking: \(u[x(s;t)]\)
  • Simplification
\[\begin{equation} \begin{cases} C_0 & c_1 \leq x(\mathbf{s};\mathbf{t}) < c_0 \\ C_1 & c_2 \leq x(\mathbf{s};\mathbf{t}) < c_1 \\ \cdots \\ C_z & c_z \leq x(\mathbf{s};\mathbf{t}) \end{cases} \end{equation}\]

Software design

DATA |>
  module1(...) |>
  module2(...) |>
  module3(...) |>
  ...

dimension_reduction(V1 = aggregate_linear(...))
dimension_reduction(V2 = aggregate_geometrical(...))
dimension_reduction(V3 = aggregate_manual(...))

The aggregate_*() function can be evaluated as a standalone recipe, before evaluated with the data in the dimension reduction module:

aggregate_manual(~x1 + x2)
[1] "aggregate_manual"
attr(,"formula")
[1] "x1 + x2"
attr(,"class")
[1] "dim_red"

Confidence interval in the SPI

A bootstrap sample of 100 is taken from the aggregated precipitation series to estimate gamma parameters and to calculate the index SPI for the Texas Post Office station in Queensland.

DATA %>%
  # aggregate monthly precipitation 
  # with a 24-month window
  aggregate(
    .var = prcp, .scale = 24
    ) %>%
  # fit a gamma distribution to 
  # obtain the probability value
  # [0, 1]
  dist_fit(
    .dist = gamma(), .var = .agg, 
    .n_boot = 100
    ) %>%
  # use the inverse CDF to 
  # convert into z-score
  augment(.var = .agg)

Confidence interval in the SPI

80% and 95% confidence interval of the Standardized Precipitation Index (SPI-24) for the Texas post office station, in Queensland, Australia. The dashed line at SPI = -2 represents an extreme drought as defined by the SPI. Most parts of the confidence intervals from 2019 to 2020 sit below the extreme drought line and are relatively wide compared to other time periods. This suggests that while it is certain that the Texas post office is suffering from a drastic drought, there is considerable uncertainty in quantifying its severity, given the extremity of the event.

Global Gender Gap Index

Global Gender Gap Index

Summary

A data pipeline comprising nine modules designed for the construction and analysis of indexes within the tidy framework.

Advantages?

  • quantify uncertainties, and
  • assess indexes’ robustness,

and more!

Reference

Slides created via quarto available at

https://sherry-jsm2024.netlify.app/


tidyindex package: https://github.com/huizezhang-sherry/tidyindex